sean's blurblog

How do you add or remove a handle from an active WaitForMultipleObjects?, part 2 by Raymond Chen
Monday April 13^th, 2026 at 4:53 PM

The Old New Thing

Last time, we looked at adding or removing a handle from an active WaitForMultipleObjects, and we developed an asynchronous mechanism that requests that the changes be made soon. But asynchronous add/remove can be a problem bcause you might remove a handle, clean up the things that the handle was dependent upon, but then receive a notification that the handle you removed has been signaled, even though you already cleaned up the things the handle depended on.

What we can do is wait for the waiting thread to acknowledge the operation.

_Guarded_by_(desiredMutex) DWORD desiredCounter = 1;
DWORD activeCounter = 0;

void wait_until_active(DWORD value)
{
    DWORD current = activeCounter;
    while (static_cast<int>(current - value) < 0) {
        WaitOnAddress(&activeCounter, &current,
                      sizeof(activeCounter), INFINITE);
        current = activeCounter;
    }
}

The wait_until_active function waits until the value of activeCounter is at least as large as value. We do this by subtracting the two values, to avoid wraparound problems.¹ The comparison takes advantage of the guarantee in C++20 that conversion from an unsigned integer to a signed integer converts to the value that is numerically equal modulo 2ⁿ where n is the number of bits in the destination. (Prior to C++20, the result was implementation-defined, but in practice all modern implementations did what C++20 mandates.)²

You can also use std::atomic:

_Guarded_by_(desiredMutex) DWORD desiredCounter = 1;
std::atomic<DWORD> activeCounter;

void wait_until_active(DWORD value)
{
    DWORD current = activeCounter;
    while (static_cast<int>(current - value) < 0) {
        activeCounter.wait(current);
        current = activeCounter;
    }
}

As before, the background thread manipulates the desiredHandles and desiredActions, then signals the waiting thread to wake up and process the changes. But this time, the background thread blocks until the waiting thread acknowledges the changes.

// Warning: For expository purposes. Almost no error checking.
void waiting_thread()
{
    bool update = true;
    std::vector<wil::unique_handle> handles;
    std::vector<std::function<void()>> actions;

    while (true)
    {
        if (std::exchange(update, false)) {
            std::lock_guard guard(desiredMutex);

            handles.clear();
            handles.reserve(desiredHandles.size() + 1);
            std::transform(desiredHandles.begin(), desiredHandles.end(),
                std::back_inserter(handles),
                [](auto&& h) { return duplicate_handle(h.get()); });
            // Add the bonus "changed" handle
            handles.emplace_back(duplicate_handle(changed.get()));

            actions = desiredActions;

            if (activeCounter != desiredCounter) {
                activeCounter = desiredCounter;   
                WakeByAddressAll(&activeCounter); 
            }
        }

        auto count = static_cast<DWORD>(handles.size());
                        
        auto result = WaitForMultipleObjects(count,
                        handles.data()->get(), FALSE, INFINITE);
        auto index = result - WAIT_OBJECT_0;
        if (index == count - 1) {
            // the list changed. Loop back to update.
            update = true;
            continue;
        } else if (index < count - 1) {
            actions[index]();
        } else {
            // deal with unexpected result
        }
    }
}

void change_handle_list()
{
    DWORD value;
    {
        std::lock_guard guard(desiredMutex);
        ⟦ make changes to desiredHandles and desiredActions ⟧
        value = ++desiredCounter;
        SetEvent(changed.get());
    }
    wait_until_active(value);
}

The pattern is that after the background thread makes the desired changes, they increment the desiredCounter and signal the event. It’s okay if multiple threads make changes before the waiting thread wakes up. The changes simply accumulate, and the event just stays signaled. Each background thread then waits for the waiting thread to process the change.

On the waiting side, we process changes as usual, but we also publish our current change counter if it has changed, to let the background threads know that we made some progress. Eventually, we will make enough progress that all of the pending changes have been processed, and the last ackground thread will be released from wait_until_active.

¹ You’ll run into problems if the counter increments 2 billion times without the worker thread noticing. At a thousand increments per second, that’ll last you a month. I figure that if you have a worker thread that is unresponsible for that long, then you have bigger problems. But you can avoid even that problem by switching to a 64-bit integer, so that the overflow won’t happen before the sun is expected to turn into a red giant.

² The holdouts would be compilers for systems that are not two’s-complement.

The post How do you add or remove a handle from an active <CODE>WaitForMultipleObjects</CODE>?, part 2 appeared first on The Old New Thing.

Read the whole story

“The problem is Sam Altman”: OpenAI insiders don’t trust CEO by Ashley Belanger
Monday April 13^th, 2026 at 10:39 AM

Ars Technica

On the same day that OpenAI released policy recommendations to ensure that AI benefits humanity if superintelligence is ever achieved, The New Yorker dropped a massive investigation into whether CEO Sam Altman can be trusted to actually follow through on OpenAI's biggest promises.

Parsing the publications side by side can be disorienting.

On the one hand, OpenAI said it plans to push for policies to "keep people first" as AI starts "outperforming the smartest humans even when they are assisted by AI." To achieve this, the company vows to remain "clear-eyed" and transparent about risks, which it acknowledged includes monitoring for extreme scenarios like AI systems evading human control or governments deploying AI to undermine democracy. Without proper mitigation of such risks, "people will be harmed," OpenAI warned, before describing how the company could be trusted to advocate for a future where achieving superintelligence means a "higher quality of life for all."

On the other hand, The New Yorker interviewed more than 100 people familiar with how Altman conducts business. The publication also reviewed internal memos and interviewed Altman more than 12 times. The resulting story provides a lengthy counterpoint explaining why the public may struggle to trust OpenAI's CEO to "control the future" of AI, no matter how rosy the company's vision may appear.

Overall, insiders painted Altman as a people-pleaser who tells others what they want to hear while questing for power in an alleged bid to always put himself first. As one board member summed up Altman, he has "two traits that are almost never seen in the same person. The first is a strong desire to please people, to be liked in any given interaction. The second is almost a sociopathic lack of concern for the consequences that may come from deceiving someone."

While The New Yorker found no "smoking gun," its reporters reviewed messages from OpenAI's former chief scientist, Ilya Sutskever, and former research head, Dario Amodei, that documented "an accumulation of alleged deceptions and manipulations." Many of the incidents could be shrugged off individually, but when taken together, both men concluded that Altman was not fostering a safe environment for advanced AI, The New Yorker reported.

"The problem with OpenAI," Amodei wrote, "is Sam himself."

OpenAI's worried public is souring on AI

Altman either disputed claims in the story or else claimed to have forgotten about certain events. He also attributed some of his shifting narratives to the changing landscape of AI and admitted that he's been conflict-avoidant in the past.

But his seeming contradictions are getting harder to ignore as scrutiny of OpenAI intensifies amid growing government reliance on its models and lawsuits that label its tech as unsafe.

Perhaps most visibly to the public, Altman has recently shifted away from positioning OpenAI as a sort of savior blocking AI doomsday scenarios, instead adopting a "tone" of "ebullient optimism," The New Yorker reported.

The policy recommendations echo this at times. Discussing the recommendations—which include experimenting with shorter work weeks and creating a public wealth fund to share AI profits—OpenAI's chief global affairs officer, Chris Lehane, confirmed to The Wall Street Journal that the company is urgently concerned about negative public opinions about AI. While announcing their big ideas to spare humanity from AI dangers, OpenAI also promoted "a pilot program of fellowships and focused research grants of up to $100,000 and up to $1 million in API credits for work that builds on these and related policy ideas."

However, The New Yorker's report makes it easier to question whether the recommendations were rolled out to distract from mounting public fears about child safety, job displacement, or energy-guzzling data centers. One recent Harvard/MIT poll found that Americans' biggest concern is that powering AI will hurt their quality of life, Axios reported. Ultimately, these concerns might sway votes for Democrats and Republicans ahead of the midterm elections, the WSJ noted, as data center moratoriums that could slow AI advancement are gaining traction.

For Altman and his company, getting the public to buy into their vision of AI at this critical juncture likely feels essential, since a loss of Republican /control of Congress could pave the way for stricter AI safety laws that The New Yorker noted that Altman has privately lobbied against.

Without trust in Altman, it's likely a much harder sell to convince the public that OpenAI isn't simply saying whatever it will take to entrench its own dominance, the New Yorker suggested.

What exactly is OpenAI pitching?

"We don’t have all, or even most of the answers," OpenAI said. Instead, the company characterized its "industrial policy for the intelligence age" as "initial ideas for an industrial policy agenda to keep people first during the transition to superintelligence."

Calling for "common-sense" regulations and a public-private partnership to quickly iterate on successes, OpenAI pitched "ambitious" policy ideas to ensure that everyone can access AI and profit from it. Its bushy-tailed vision acknowledged that it hopes to achieve what society never did: guarantee Internet access and ensure AI is "fairly deployed" across the US, with everyone trained to use it.

Worker protections are a focus of OpenAI's plan. Recommendations included involving workers in discussions on how AI systems work to improve productivity and make workplaces safer, as well as on how to "set clear limits on harmful uses of AI." OpenAI also suggested creating a tax on automated labor that could be used to fund core programs like Social Security, Medicaid, SNAP, and housing assistance as companies rely less on human labor. Among other enticing ideas was a plan to "incentivize employers and unions to run time-bound 32-hour/four-day workweek pilots with no loss in pay that hold output and service levels constant, then convert reclaimed hours into a permanent shorter week, bankable paid time oﬀ, or both."

Additionally, OpenAI proposed a "public wealth fund" that "provides every citizen—including those not invested in financial markets—with a stake in AI-driven economic growth."

"Returns from the Fund could be distributed directly to citizens, allowing more people to participate directly in the upside of AI-driven growth, regardless of their starting wealth or access to capital," OpenAI said.

As AI takes on more tasks, humans can gravitate toward care-centric work, OpenAI suggested, recommending policy ideas to help displaced workers get training to work in health care, elderly care, daycare, or community service settings. To ensure people are attracted to those roles—historically undervalued as women's work—OpenAI suggested initiatives to help society recognize that caregiving is "economically valuable work."

Human workers will also be needed to use AI to accelerate scientific advancements, OpenAI said.

However, all these public benefits that OpenAI promises can only be realized if we build a "resilient society" that can quickly respond to risky implementations and "keep AI safe, governable, and aligned with democratic values," the company said.

That aspect of OpenAI's vision requires firms like OpenAI to develop safety systems, among other efforts, that will help improve public trust in AI. And we should trust those systems will work and only interfere with these firms when actual dangers are looming, OpenAI seems to suggest.

"As we progress toward superintelligence, there may come a point where a narrow set of highly capable models—particularly those that could materially advance chemical, biological, radiological, nuclear, or cyber risks—require stronger controls," OpenAI said.

When that day arrives, OpenAI opined, there should be a global network in place to communicate emerging risks. However, only the firms with the most advanced models should be subjected to rigorous audits, so that smaller firms can still compete. That's the path to ensure no firm's dominant position can be abused to unfairly shut down rivals or weaken democratic values, OpenAI said, while insisting that public input is vital to AI's success.

Altman has previously persuaded "a tech-skeptical public that their priorities, even when mutually exclusive, are also his priorities," The New Yorker reported. But for the public, which is already reporting alleged harms from OpenAI models, it might be getting harder to entertain lofty ideas from a company that is led by "the greatest pitchman of his generation," The New Yorker reported.

One OpenAI researcher told The New Yorker that Altman's promises can sometimes seem like a stopgap to overcome criticism until he reaches the next benchmark. When it comes to superintelligence, some optimistic experts think it could take two years, which is longer than Elon Musk stayed at OpenAI before famously criticizing Altman's leadership and leaving to start his own AI firm.

Altman "sets up structures that, on paper, constrain him in the future," the OpenAI researcher told The New Yorker. "But then, when the future comes and it comes time to be constrained, he does away with whatever the structure was."

Read full article

Comments

Read the whole story

Testing suggests Google's AI Overviews tell millions of lies per hour by Ryan Whitwam
Monday April 13^th, 2026 at 10:35 AM

Ars Technica

Looking up information on Google today means confronting AI Overviews, the Gemini-powered search robot that appears at the top of the results page. AI Overviews has had a rough time since its 2024 launch, attracting user ire over its scattershot accuracy, but it's getting better and usually provides the right answer. That's a low bar, though. A new analysis from The New York Times attempted to assess the accuracy of AI Overviews, finding it's right 90 percent of the time. The flip side is that 1 in 10 AI answers is wrong, and for Google, that means hundreds of thousands of lies going out every minute of the day.

The Times conducted this analysis with the help of a startup called Oumi, which itself is deeply involved in developing AI models. The company used AI tools to probe AI Overviews with the SimpleQA evaluation, a common test to rank the factuality of generative models like Gemini. Released by OpenAI in 2024, SimpleQA is essentially a list of more than 4,000 questions with verifiable answers that can be fed into an AI.

Oumi began running its test last year when Gemini 2.5 was still the company's best model. At the time, the benchmark showed an 85 percent accuracy rate. When the test was rerun following the Gemini 3 update, AI Overviews answered 91 percent of the questions correctly. If you extrapolate this miss rate out to all Google searches, AI Overviews is generating tens of millions of incorrect answers per day.

The report includes several examples of where AI Overviews went wrong. When asked for the date on which Bob Marley's former home became a museum, AI Overviews cited three pages, two of which didn't discuss the date at all. The final one, Wikipedia, listed two contradictory years, and AI Overviews confidently chose the wrong one. The benchmark also prompts models to produce the date on which Yo Yo Ma was inducted into the classical music hall of fame. While AI Overviews cited the organization's website that listed Ma's induction, it claimed there's no such thing as the Classical Music Hall of Fame.

Google doesn't much like this test. Google spokesperson Ned Adriance tells the Times that Google believes SimpleQA contains incorrect information. Its model evaluations often rely on a similar test called SimpleQA Verified, which uses a smaller set of questions that have been more thoroughly vetted. "This study has serious holes," Adriance told the Times. "It doesn’t reflect what people are actually searching on Google."

Benchmark problems

Evaluating new AI models sometimes feels more like art than science, which is part of the problem. Every company has its own preferred way of demonstrating what a model can do, and the non-deterministic nature of gen AI can make it hard to verify anything. These robots can get a factual question right and then completely miss it if you rerun the query immediately. Oumi even uses AI tools to run its assessments, and those models can hallucinate, too.

The other wrinkle is that AI Overviews isn't a single monolithic model. Google told Ars Technica that it uses the "right model" for each query. While AI Overviews would get the best answers from always running Gemini 3.1 Pro, that's slow and expensive. To load things promptly on a search page, the overview uses faster Gemini Flash models when possible (which appears to be most of the time).

Google's response to this report is telling. In the realm of AI factuality, 9 out of 10 isn't even that bad. Google has recently published benchmarks for new model releases featuring measurements of factuality in the range of 60 to 80 percent—these tests are run without tools like web search. Grounding an AI with more data, like the wealth of human knowledge on the Internet, does make it more accurate than the naked model itself. However, the truth is in the blue links somewhere, and AI Overviews encourages people to accept its sometimes inaccurate summaries instead of checking those sources manually.

While Google says the Times' results don't match what people see, you have to wonder how the company could even know that. You've probably seen mistakes in AI Overviews—we all have because that's just how generative AI works. As Google itself reminds you at the bottom of every overview: "AI can make mistakes, so double-check responses."

Read full article

Comments

Read the whole story

YouTube increases Premium price again, says 90-second unskippable ads are a bug by Ryan Whitwam
Monday April 13^th, 2026 at 9:28 AM

Ars Technica

Over the years, YouTube has evolved from a source of Rickrolls and cat videos to a platform for some of the Internet's most popular streaming content. Today, it costs more than ever to see that content, as YouTube has announced another price increase for its Premium service. Viewers who can't stomach the cost of Premium will be greeted by increasingly lengthy ad breaks, but YouTube says some of that is due to a bug it's now addressing.

YouTube has not posted a standalone blog announcing the change, but existing subscribers are getting email alerts. The higher pricing is also live for new sign-ups in the US as of this writing. Here's the important part of YouTube's email alerts:

To continue delivering great service and features, we’re increasing your price to $15.99/month. We don’t make these decisions lightly, but this update will allow us to continue to improve Premium and support the creators and artists you watch on YouTube.

You will see the change reflected on your June 7, 2026 billing date.

The new $15.99 monthly price is a $2 increase, but if you're on the family plan, the email looks a bit different. For those folks, the price is now $26.99, which is $4 higher. There's also the base Premium Lite subscription that removes most YouTube ads and used to cost $7.99 per month. It's now $1 more.

YouTube's subscription tier initially launched in 2015 as YouTube Red at $9.99 per month for the individual plan. In 2018, it morphed into YouTube Premium with a higher $11.99 cost. Then came the 2023 price hike to $13.99. This is the first US price increase for YouTube Premium since 2023, but many international viewers saw increases in 2024.

YouTube isn't alone—streaming prices continue their inexorable climb across the board. Netflix seemingly can't go a year without boosting prices, with the most recent increase coming just last month. Meanwhile, Amazon Prime Video is raising prices and removing features from its lower-tier plans. In unrelated news, Internet piracy rates are rising worldwide.

Pay with your wallet or your attention

Unlike with most streaming services, those who can't stomach YouTube's latest price increase have an option. Free users can browse and stream as many YouTube videos as they want, but they'll have to contend with ads. After earning more than $40 billion in ad revenue in 2025, the site expanded the use of unskippable 30-second ads in the TV app this year. Previously, the longest you'd have to wait before getting back to your video was 15 seconds.

But viewers have increasingly pointed to even longer ad breaks. In recent days, reports of 90-second unskippable ads have proliferated. The company has responded to the kerfluffle, saying, "YouTube does not have a 90-second non-skippable ad format. This isn’t something we are testing right now." The company's post on X has since been "community noted" to reaffirm the existence of 90-second unskippable ads.

Credit: X

Despite YouTube's assurances, many, many viewers report seeing these longer ads, and there are several images that appear to show unskippable 90-second ad breaks. YouTube users have accused the company of lying or using deceptive language in its denial.

Some viewers report that these extra-long breaks are a mix of ad types. They begin with a 30-second unskippable ad, and the player then rolls into a few shorter skippable ads. However, the interface only shows the standard "Skip in" text with a countdown until all the ads are over. The good news is that this is an error, and YouTube is working on it.

The YouTube interface makes this look like an unskippable 90-second ad even if it's not. Credit: /u/Ok_Neat1652

YouTube now says it has determined these longer unskippable ads are an interface bug. "We’ve determined this was a result of a bug, which resulted in higher, inaccurate timers being shown for shorter ads," a company spokesperson said. "We’re rolling out a fix now. As we’ve said, we don’t have a 90 second non-skippable ad format and this was not a test."

YouTube just isn't the streaming video free-for-all it once was. You'll have to pay in one way or another if you want to watch YouTube content. The site will either take an ever larger bite of your budget, or you'll have to sit through more ads than ever before. There are alternative YouTube clients that can strip out ads, and ad-blockers can do the same on the web. However, it's a cat-and-mouse game as YouTube works to block the blockers.

Read full article

Comments

Read the whole story

Report: US demands Reddit unmask ICE critic, summons firm to grand jury by Jon Brodkin
Monday April 13^th, 2026 at 9:28 AM

Ars Technica

The Trump administration has stepped up an effort to unmask a Reddit user who criticized Immigration and Customs Enforcement (ICE). After failing to obtain information through a summons issued to Reddit, the government reportedly issued a subpoena demanding that Reddit provide the information and appear before a grand jury in Washington, DC.

The Intercept described the subpoena today. "According to a subpoena obtained by The Intercept, Reddit has until April 14 to provide a wide range of personal data on one of its users, whom US Immigration and Customs Enforcement agents have been trying unsuccessfully to identify for more than a month," the article said.

The legal saga began in US District Court for the Northern District of California. On March 12, the anonymous Reddit user whose information is being sought filed a motion to quash a summons seeking a host of information from Reddit. The summons was issued by the Department of Homeland Security and directed Reddit to turn information over to an ICE senior special agent.

The summons cited authority under 19 U.S. Code § 1509, which is part of the Smoot-Hawley Tariff Act of 1930. The motion to quash said the summons is not authorized by the law, which deals with imports of boats, alcoholic drinks, and animals, among other things.

"J. Doe is a US citizen who has not traveled out of the country, is not engaged in any international commerce, has no business concerns outside the United States, and primarily uses their Reddit account to engage in political speech relevant to their local community," said the filing by the Civil Liberties Defense Center (CLDC), which represents the Reddit user. "Yet the government claims the right to obtain Doe’s name, telephone number, home address, banking and credit card information, IP addresses, telephone model number(s), and the names of any other accounts associated with their Reddit account. The information sought by the government in no way pertains to customs or importing or exporting merchandise, and is clearly intended to chill free speech."

No "criminal activity or intent" in user's posts

The Trump administration has accused ICE critics of doxxing agents in some cases. But the copy of the summons to Reddit available on the court docket doesn't identify any specific posts made by the Reddit user, who lives in Oregon.

"When John Doe’s attorneys later reviewed their Reddit posts, they found nothing to suggest criminal activity or intent," The Intercept article said. In one instance, Doe commented in response to a Minnesota Star Tribune article in January 2026 about Jonathan Ross, the ICE officer who fatally shot Renée Good in Minneapolis.

"John Doe responded by sharing that Ross had lived in Chaska, Minnesota; grew up in Indiana; and served in the Indiana National Guard—biographical details that were circulating widely at the time," and wrote that “Hopefully he moves up to Stillwater State Penitentiary," according to The Intercept.

On a different occasion, Doe suggested that another Reddit user write "Urine speaks louder than words" on an anti-ICE protest sign. “TSA sucks and we all know it," Doe wrote in another comment thread.

Doe submitted a declaration stating, "I utilize this account to engage in political speech through direct posts, as well as dialogue with community members in comment threads associated with my own and others’ posts. Reddit allows users to post and engage on the service without publicly disclosing their legal identity. Like many other users on the site, I use Reddit to converse anonymously."

Case ended, but then Reddit was subpoenaed

The dispute seemed to be over in late March when the Department of Homeland Security (DHS) rescinded the summons and notified Reddit of this decision. The proceeding in the California court was dismissed at Doe's request.

But on March 31, "Reddit received another message from the feds," The Intercept reported. "This time, instead of requesting information on an individual user, the government ordered Reddit itself to appear before a grand jury—not in California, but in Washington."

The subpoena was issued by prosecutors from the US Attorney's office in DC, and the "records sought spanned a period roughly three times longer than what ICE had originally requested," the article said. The US Attorney for the District of Columbia is Jeanine Pirro. The grand jury subpoena is a new tactic being used by the Trump administration after it repeatedly lost attempts to subpoena information in court, The Intercept was told by CLDC Executive Director Lauren Regan.

Grand jury proceedings are not public. Grand juries may issue indictments after assessing evidence presented by prosecutors to determine whether there is probable cause that someone committed a crime. Witnesses may be called to give testimony. If an indictment is issued, the accused would be put on trial.

"The only valid use of a grand jury is to investigate federal crimes,” Regan told The Intercept. It's unclear how Doe's Reddit posts are evidence of a crime, and the administration is "able to hide what they are doing under the guise of a federal grand jury," she said.

While the now-withdrawn summons is public, we do not have a copy of the subpoena. The CLDC told Ars today that it has no further comment on the case and noted that grand jury subpoenas are issued in secret.

Reddit: "We do not voluntarily share information with any government"

David Greene, senior counsel for the Electronic Frontier Foundation, "knew of no examples during the recent wave of immigration enforcement-related investigations in which a leading tech company has been called to appear before one of the secret panels," The Intercept article said. "Free speech protections are at their weakest in the context of a grand jury, he explained: The proceedings are not adversarial; their purpose is to permit a prosecutor to file charges."

“We should be very, very, very concerned that they’ve now taken one of these to a grand jury,” Greene was quoted as saying. “It’s something to be taken very seriously.”

A Reddit spokesperson told Ars today that "we seek to inform users of any legal process compelling disclosure of their data, as we did in this case, because users should have the agency to protect their own information and are often better positioned to challenge requests that impact them."

Reddit didn't provide any details on the subpoena but said, "We do not voluntarily share information with any government, especially not on users exercising their rights to criticize the government or plan a protest. We review every inquiry for legal sufficiency and routinely object to requests that are overbroad or threaten civil rights. When legally compelled to disclose data, we provide only the minimum required and notify the user whenever possible so they can defend their interests.”

We contacted the US attorney's office in DC, the DHS, and ICE and will update this article if we get any new information.

Disclosure: Advance Publications, which owns Ars Technica parent Condé Nast, is the largest shareholder in Reddit.

Read full article

Comments

Read the whole story

AI models are terrible at betting on soccer—especially xAI Grok by Tim Bradshaw, Financial Times
Monday April 13^th, 2026 at 9:10 AM

Ars Technica

AI models from Google, OpenAI, and Anthropic lost money betting on soccer matches over a Premier League season, in a new study suggesting even the most advanced systems struggle to analyze the real world over long periods.

The “KellyBench” report released this week by AI start-up General Reasoning highlights the gap between AI’s rapidly advancing capabilities in certain tasks, such as writing software, and its shortcomings in other kinds of human problems.

London-based General Reasoning tested eight top AI systems in a virtual re-creation of the 2023–24 Premier League season, providing them with detailed historical data and statistics about each team and previous games. The AIs were instructed to build models that would maximize returns and manage risk.

The AI “agents” then placed bets on the outcomes of matches and the number of goals scored to test how they could adapt to new events and updated player data as the season progressed.

The AI could not access the Internet to retrieve results and each was given three attempts to turn a profit.

Anthropic’s Claude Opus 4.6 fared best, with an average loss of 11 percent and nearly breaking even on one attempt.

xAI’s Grok 4.20 went bankrupt once and failed to complete the other two tries. Google’s Gemini 3.1 Pro managed to turn a 34 percent profit on one go but went bankrupt on another.

“Every frontier model we evaluated lost money over the season and many experienced ruin,” the authors of the paper concluded, with the AI “systematically underperforming humans” in this scenario.

AI Model	Mean ROI	Best try	Worst try	Mean final bankroll
Anthropic Claude Opus 4.6	–11.0%	–0.2%	–18.8%	£89,035
OpenAI GPT-5.4	–13.6%	–4.1%	–31.6%	£86,365
Google Gemini 3.1 Pro	–43.3%	+33.7%	–100.0%	£56,715
Google Gemini Flash 3.1 LP	–58.4%	+24.7%	–100.0%	£41,605
Z.AI GLM-5	–58.8%	–14.3%	–100.0%	£41,221
Moonshot Kimi K2.5	–68.3%	–27.0%	–100.0%	£7,420
xAI Grok 4.20	–100.0%	–100.0%	–100.0%	£0
Acree Trinity	–100.0%	–100.0%	–100.0%	£0
Each model began with a £100,000 normalized bankroll. Return on investment and final bankroll are averaged across three tries. Grok and Trinity did not complete every attempt.

The results offer some comfort to white-collar professionals and businesses who are fretting that AI could take their jobs, as it roils the shares of industries from finance to marketing.

Ross Taylor, one of the study’s authors and General Reasoning’s chief executive, said: “There is so much hype about AI automation, but there’s not a lot of measurement of putting AI into a longtime horizon setting.”

He added that many of the benchmarks typically used to test AI are flawed because they are set in “very static environments” that bear little resemblance to the chaos and complexity of the real world.

General Reasoning’s paper, which has not yet been peer reviewed, provides a counterweight to growing excitement in Silicon Valley about the huge recent leaps in AI’s ability to complete computer programming tasks with little to no human intervention.

Taylor, a former Meta AI researcher, said: “If you... try AI on some real-world tasks, it does really badly... Yes, software engineering is very important and economically valuable, but there are lots of other activities with longer time horizons that are important to look at.”

Read full article

Comments

Read the whole story

How do you add or remove a handle from an active Wait­For­Multiple­Objects?, part 2 by Raymond Chen Monday April 13th, 2026 at 4:53 PM

“The problem is Sam Altman”: OpenAI insiders don’t trust CEO by Ashley Belanger Monday April 13th, 2026 at 10:39 AM

OpenAI's worried public is souring on AI

What exactly is OpenAI pitching?

Testing suggests Google's AI Overviews tell millions of lies per hour by Ryan Whitwam Monday April 13th, 2026 at 10:35 AM

Benchmark problems

YouTube increases Premium price again, says 90-second unskippable ads are a bug by Ryan Whitwam Monday April 13th, 2026 at 9:28 AM

Pay with your wallet or your attention

Report: US demands Reddit unmask ICE critic, summons firm to grand jury by Jon Brodkin Monday April 13th, 2026 at 9:28 AM

No "criminal activity or intent" in user's posts

Case ended, but then Reddit was subpoenaed

Reddit: "We do not voluntarily share information with any government"

AI models are terrible at betting on soccer—especially xAI Grok by Tim Bradshaw, Financial Times Monday April 13th, 2026 at 9:10 AM

How do you add or remove a handle from an active WaitForMultipleObjects?, part 2 by Raymond Chen
Monday April 13^th, 2026 at 4:53 PM

“The problem is Sam Altman”: OpenAI insiders don’t trust CEO by Ashley Belanger
Monday April 13^th, 2026 at 10:39 AM

Testing suggests Google's AI Overviews tell millions of lies per hour by Ryan Whitwam
Monday April 13^th, 2026 at 10:35 AM

YouTube increases Premium price again, says 90-second unskippable ads are a bug by Ryan Whitwam
Monday April 13^th, 2026 at 9:28 AM

Report: US demands Reddit unmask ICE critic, summons firm to grand jury by Jon Brodkin
Monday April 13^th, 2026 at 9:28 AM

AI models are terrible at betting on soccer—especially xAI Grok by Tim Bradshaw, Financial Times
Monday April 13^th, 2026 at 9:10 AM